Direct Pattern Matching on Compressed Text

نویسندگان

  • Edleno Silva de Moura
  • Gonzalo Navarro
  • Nivio Ziviani
  • Ricardo A. Baeza-Yates
چکیده

We present a fast compression and decompression technique for natural language texts. The novelty is that the exact search can be done on the compressed text directly, using any known sequential pattern matching algorithm. Approximate search can also be done ee-ciently without any decoding. The compression scheme uses a semi-static word-based modeling and a Huu-man coding where the coding alphabet is byte-oriented rather than bit-oriented. We use the rst bit of each byte to mark the beginning of a word, which allows the searching of the compressed pattern directly on the compressed text. We achieve about 33% compression ratio for typical English texts. When searching for simple patterns, our experiments show that running our algorithm on a compressed text is almost twice as fast as running agrep on the uncompressed version of the same text. When searching complex or approximate patterns, our algorithm is up to 8 times faster than agrep.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Pattern Matching Machine for Text Compressed Using Finite State Model

The classical pattern matching problem is to nd all occurrences of patterns in a text. In many practical cases, since the text is very large and stored in the secondary storage, most of the time for the pattern matching is dominated by data transmission of the text. Therefore the text compression can speed-up the pattern matching. In this framework it is required to develop an e cient pattern m...

متن کامل

Approximate Pattern Matching Over the Burrows-Wheeler Transformed Text

The compressed pattern matching problem is to locate the occurrence(s) of a pattern P in a text string T using a compressed representation of T , with minimal (or no) decompression. In this paper, we consider approximate pattern matching directly on Burrow-Wheeler transformed (BWT) text which is a critical step for a fully compressed pattern matching algorithm on a BWT based compression algorit...

متن کامل

Searching in Compressed Dictionaries

The problem of Compressed Pattern Matching , introduced by Amir and Benson [1], is of performing pattern matching directly in a compressed text without any decompressing. More formally, for a given text T , pattern P and complementary encoding and decoding functions E and D, respectively, our aim is to search for E(P ) in E(T ), rather than the usual approach which searches for the pattern P in...

متن کامل

Byte pair encoding : a text compression scheme that accelerates pattern matching

Byte pair encoding (BPE) is a simple universal text compression scheme. Decompression is very fast and requires small work space. Moreover, it is easy to decompress an arbitrary part of the original text. However, it has not been so popular since the compression is rather slow and the compression ratio is not as good as other methods such as Lempel-Ziv type compression. In this paper, we bring ...

متن کامل

Compressed Pattern Matching for SEQUITUR

Sequitur due to Nevill-Manning and Witten. [18] is a powerful program to infer a phrase hierarchy from the input text, that also provides extremely effective compression of large quantities of semi-structured text [17]. In this paper, we address the problem of searching in Sequitur compressed text directly. We show a compressed pattern matching algorithm that finds a pattern in compressed text ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1998